A CFG is said to be in Greibach Normal Form (GNF), if all its rules are of the following forms:
where, are nonterminals, and is some terminal and .
And if , then we can include
, and does not appear on the right hand side of any rule
Note in GNF, no left recursion can happen. This will be helpful for some top-down parsers.
To show that any CFG can be converted to GNF, we will construct an elegant mapping with the help of least fixed point property.
Least Fixed Point
Let be a partially order set. We say a sequence is a -chain if for . Every -chain has a least upper bound, denoted by
We say is complete poset iff it has a least element and every -chain has a least upper bound in .
Now let and be complete posets. A function is said to be -continuous iff
it is monotonic, i.e. if , then
, where
Theorem 1 (Least Fixed Point): Let be a complete poset. Then every -continous function has a least fixed point.
Proof: Let be the least element. We will use and .
The trick is to construct a -chain as follows:
To show it is -chain, note that is -continuous hence it is monotonic. Since is the least element, we have . Since is monotonic, we have . Continue this by induction on , we prove the chain is a -chain.
Since is complete poset, the chain has a least upper bound in , denoted by
is the least fixed point we are looking for. To see this, recall is -continuous. Thus , which shows is a fixed point. For any fixed point , since we have for any . Thus . This proves is the least fixed point.
Derivations at “Equilibrium”
For a CFG , let be its nonterminals. We can rearrange its production rules and group by nonterminals as:
…
, where and “” means “or”. For example, if we have two rules and , we can write . This notation simplicity later turns out to be an elegant way to represent CFG derivation by mimicing matrix multiplicaton and addition.
If we start derivating from some nonterminal, say , what would look like? and how its elements are added to the set gradually?
At step 0 where no nonterminal is rewritten, can only include some terminal strings if they don’t contain any nonterminals. The same goes to . So “after step 0”, the language dervied from contains some initial strings only. Let’s move 1 step further and look at again. Now the nonterminals in its start to “unfold”. For example, if in some , then it will unfold to all possible strings in in step 0. The same goes to all the other nonterminals.
So at each step, the language grows bigger and bigger. You may feel that there will some “limit” for how large can go, and will be some “equilibrium” so that when step , all “stop” at deriving new strings. At infty, the final will be some sort of “fixed point”.
The above idea is exactly what we will construct next and utilize the least fixed point theorem.
Formally let’s give more notation and definition.
Let be the set of symbols and be the set of terminals. We use to denote the set of nonterminals. Note a language is a finite subset of strings in . Similarly, we use letter to denote a finite subset of nonterminals and terminals in , i.e.
For any tuple of languages we define a function as follows:
, for all
, where is a nonterminal
, where and the dot operation means concatenating every pair of strings from the two set. Sometimes we omit the dot sign for simplicity.
, where and the plus operation means “or”, i.e. we union the 2 sets
You can easily verify that is a well defined function. Now we use it to define a mapping . Assume the rules in grammar are written in the form of (1) above, and let
The definition in (2) is a formal way to specify what we have discusseed. Given a tuple of languages we want to use it to expand all nonterminals in the rules and thus obtain a new set of languages .
So if we start with and apply , we will obtain , where each is the language derived from after 1 step. We keep doing this and eventually obtain . As you can guess, is the language derived from start symbol , i.e. , and they are exactly the least fixed point of .
We shall prove three lemmas, which will lead to our desired result in a theorem.
Note for any set , is a relation defined on as iff . So in our case, a paritially order relation is defined as iff , and iff for every we have .
Lemma 2: is a -continuous function.
Proof: Let and . Suppose , we want to show . Now consider their i-th components: and .
It suffices to show that for every , we have
Let , where and for . Note means this is a -rule. Then clearly (1) holds. If , then , for . If is a terminal, then . If is a nonterminal , then . By assumption, we have . Thus, we conclude that (1) holds and is monotonic.
Next, consider any -chain . Let . We need to show . This is to show for every , their i-th components are equal. Let be the i-th component of and be the i-th component of . Then we need to show and .
First, consider . There must be some such that . If we insepct closely, the set is obtained by rewriting each nonterminal in with the strings in their corresponding languages in , and is in this set. Since is increasing, there must be some such that . This shows .
Conversely, if , there is some and some such that . Then . Hence . This completes the proof.
By Theorem 1 (Least fixed point) and Lemma 2, has a least fixed point. It is constructed as
Note is the least element. Thus the least fixed point of is
Now we want to prove for all . This is establised by the following two lemmas.
Lemma 3: for all .
Proof: We prove by induction on . For , note if and only if is a production rule. Thus, can be derived from in one step and we clearly have .
Assume it holds for for every . Consider . Note . So for some . We expand , where is a terminal and is a nonterminal. By induction assumption, for each , if (with a bit abuse of notation, we let to denote
if ), then , i.e. derives in step. Thus can derive in steps, and can derive in steps. Hence .
Lemma 4:, for all .
Proof: We prove by induction on derivation length . If , then there must be a production rule . Hence . Now assume it holds for derivation length for all . Again, there must be some rule from which is reached. So in the derivation path there must exist a sub path for each (Note we can always turn a derivation path to an equivalent leftmost derivation) such that and eventaully . Note the length of such a sub path is less than . Thus by induction assumption, each for some . Now take . Then we have .
Finally with all discussions above, we arrive at
Theorem 5: is the least fixed point of .
Convert to Weak Greiback Normal Form
It is not difficult to see that if we are able to achieve , where is a terminal and can be any string, then we will be able to further convert it to the strong Greibach Normal Form mentioned at the begining of this note. The above form is called Weak Normal Form. As long as the first symbol is a terminal, we eliminate the possibility of left recursion. To achieve this, we need to introduce more “notation tricks” to represent production rules as matrix operations.
For a CFG , as above we let and assume contains nonterminals . We can group productions by their left-hand-side nonterminals as
Furthermore, on the right hand side, we can group by their first symbol, i.e.
If we write it in “matrix” form, (1) will become ,where is a row vector , is a row vector and each of its entries is a(possibly empty) finite subset of . Note if a is an empty set, then . Finally, is also a row vector where its entry is a string (possibly ) that starts with a terminal symbol.
So the set of production rules can be expressed as a matrix operation as: . We normally use to represent variables, so it becomes
Langauges over or form a semiring with
Addition : the union operations, i.e.
Multiplication :
Addition identiy : this is the emptyset
Multiplication identity : this is the
For a matrix , we let
We look at (3) again and in fact we can let . Theorem 5 says that is the least fixed point.
You can also find out that is actually derived by applying infinite number of times. Be careful that in (4), and can also contain nonterimnals . Thus when we apply a , we will write .
Now if we assume and do not contain any nonterminal, then the least fixed point of can be found by following the procedure in Theorem 5. First applying by once we have . Applying twice yield Applying it times yield . Thus, when we get the least fixed point as
Lemma 6: The fixed point of a gramma , whose production rules are expressed as where do not contain any nonterminal, has the least fixed point as .
Proof: It has been shown above.
Now the magic to turn any to a weak Greiback form is to create new nonterminals and rewrite the production rules from to . Note in above is a matrix.
We call this new grammar as . We let to denote the least fixed point of , and to denote the least fixed point of (where corresponds to the component, and to ). The elegance is that and hence and produces the same language. After this operation, rules in all start with terminal (we may assume no -rule or chain rule exists by first converting to CNF). To further make rules in to all start with terminal symbol, notice that only contain nonterminals in . So if is the fixed point, we simply let , and thus would meet our needs.
Now it remains to prove that .
Theorem 7: Let be expressed as
and be expressed as
If is the least point of and is the least fixed point of , then we have .
Proof: Since is the least fixed point of , we have . Because contains no nonterminals, by Lemma 6 the least fixed point of is . Therefore, (since by (1) is also a solution to is ).
Note the functions are both monotomic, by (2) we have
In the least fixed point theorem, for any with we have . So if we let , then (3) is showing with . Therefore, . Combining (2) and (4) we get
Similarly for , since is the least fixed point we have
Again using the trick of Lemma 6 and the fact that does not contain any nonterminal, we have the least fixed point of is . But (6) exactly states that is also a solution. So we have
Now notice . Again we have and and . Since is the least fixed point of , we have . Combining (7) and (8) we have
To proceed, observe that is a solution to . In fact, is true because by (5) we have . Also is an identity. Since is the least fixed point of , we must have .
On the other hand, using first half of (6) and result of (9) we get . We can see that is a solution to by (10). In fact,
. Since is the least fixed point of , we must have . This completes the proof.
A CFG is said to be in Chomsky Norma Form, if all its productions are only of the following forms:
, where are non-terminals
, where is a terminal and
if , and there never occurs on the right hand side of any production rule
The above is saying you either produce 2 non-terminals, or produce 1 terminal. And there will be no “-rules” and no “chain rules”. The only possible -production is directly from the start symbol.
We can build an algorithm to convert any CFG to an equivalent in CNF form. The idea is to do it at 3 steps: (1) eliminate -rules; (2) eliminate chain rules; (3) restrict all right hand sides to either 2 non-terminals or 1 terminal
Eliminate -rules
For a non terminal , if there is some chain that evetually leads to then we call “erasible”. If we eliminate all rules, then we need to effectively elinate all “erasible chains” as well.
Formally, for a CFG we use to denote the set of erasible non terminals. We construct inductively as follows:
We can see . We then let
Since only has finitely many non terminals, there must be a max such that . We can formally prove contains all easible non terminals. The following theorem gives a procedure to formally remove all -rules while letting the new grammar generate the same language.
Theorem 1 (Eliminate -rules): For any CFG , there is an equivalent such that
There are no rules of the form ,
except that if the only possible such rule is and does not appear on the right hand side of any rule
Proof: If , we create a new non-terminal and add to the rules. This make sure that will never occur on the right hand side of any rule.
We proceed to eliminate all -rules. So in the new rule set, we have as its subset as .
Now we need to add some new rules to maintain “erasibility” of original symbols. This is done by constructing a set of new rules as
So the new rule set would be if or otherwise. This completes the proof.
Note in above, some can contains erasible non terminals. You should convince yourself that the construction of indeed includes all possible new rules.
Eliminate Chain Rules
Given any CFG , with Theorem 1 we obtain an equivalent with no -rules. The next step is to eliminate chain rules.
Here we construct inductive, for a non-terminal , its set of chainable non-terminals as
So . There is a max such that .
Theorem 2 (Eliminate Chain Rules): For any CFG with no -rules, there is an equivalent such that: except the possible , all rules satisfie
If contains some non terminals, then
Proof: This is to say we first eliminate all unit rules where right hand side is a non terminal . Then we need to add back some rules to maintain the equivalence.
For a non terminal , consider its . For each , if there are some where either or , then we add a new rule .
Final Transformation
After eliminating -rules and chain rules, all the rules are of
if then and does not appear on the right hand side of any rule
, where , or or .
Theorem 3: For any CFG , there is an equivalent in Chomsky Normal Form.
Proof: We assume has no -rules and no chain rules. First we deal with terminal productions. For a rule , if is a terminal we retain it. If are all terminals but has length , i.e. , then we do
create new non terminals $P_1,…,P_{n-1}, P_n, P’1,…,P’{n-1}$,
and create following new rules:
…
$P’{n-1} \rightarrow a{n-1}$
Now if in we have contains some non terminals, then we do
assume , where are non terminals and are all terminal strings
create new non terminals and add rules . For each such rule, repeat the above process to transform into CNF form. (if is empty, then ignore it)
consider rule . We create new non terminals and new rules:
…
CYK Parser
For a CFG in Chomsky Normal Form, a CYK parser can both regconize and parse any string . Its worst case takes where . In practice, almost no is designed in CNF. And even we can transform it to CNF, a CYK parser can only parse it to CNF tree but not its original parse tree. Thus, CNF and CYK parser are not popular in practice. We will not proceed to note any details here.
is the set of terminal symbols. It is also the alphabet
is the set of production rules. It is a relation on . A production rule is something like , where can be non-terminal or terminal symbol
is the start symbol
In simple words, a CFG always starts with and at each step rewrites a non-terminal symbol to some string according to some production rule, until all symbols are terminals.
We say a string (or sentence) is accepted by a CFG iff , i.e. derives . The language accepted by is denoted by .
We can define: derives ”, iff there is a finite chain of strings such that , and for .
In above, we still to define what we mean by . In general, we say to mean derives in one step and satifies that
, where (note can be empty ), and is a non-terminal
, where and is a production rule
We say a derivation is leftmost if for every , must be of the forms:
, where and there is no non-terminal in
, where is a production rule
We can define rightmost derivation likewise
Now the first non-trivial result is we can transform every derivation to a leftmost (or rightmost) derivation.
Leftmost (rightmost) derivations
Theorem 1: Suppose a CFG . If is accepted by , then there is a leftmost derivation such that .
Proof: Since is accepted by , there must be a derivation . Here the means the length of derivation is . We prove by induction on .
Note we can’t prove by induction on that has a leftmost derivation. But rather we claim: for any string , the derivation has a leftmost derivation and it has the same length.
The case of is obvious. Suppose it holds for . Now consider and original derivation is . We can write .
If is leftmost, by induction assumption there must be an equivalent leftmost derivation of the same length to replace , so we are done.
If is not leftmost, then we consider . Note is already leftmost (by induction assumption). So must be of the following form:
, where are all terminals, and , and are non-terminals
, where is a production rule
, where is a production rule
Now we swap the order of , and is still a valid derivation. Clearly is leftmost now. For , by induction assumption, we can find an equivalent leftmost derivation. So the whole new derivation is leftmost. This completes the proof.
The idea to measure a set in is to create a function with three important properties
Any interval is measurable
Additivity: if are mutually disjoint, then
Translation Invariant: , where means the set
It turns out that such a measure is not possible. At best we can define a measure to cover as many sets as possible. This measure is Lebesgue measure.
We start with constructing a weaker measure named Lebesgue outer measure. This measure covers all sets in but does not possess Additivity property. But it does have sub-additivity, i.e. if then .
For an open interval , we use to denote its length.
Definition 1 (Lebesgue Outer Measure): For any set in , its Lebesgue outer measure is defined as
That is, for any open cover that covers , we take the “minimum” one and sum their lengths.
You can easily prove that for any open , closed , half-open-half-closed interval , its outer measure
Lebesgue outer measure has properties of translaton invariance and sub-additivity.
Theorem 1 (Translation Invariant): For any set and any real number , we have
Proof: Note that this is true if is an interval. We must have because any cover that covers we have that must cover . Since , we have is a subset of . The reverse is also true.
Theorem 2 (Countable sub additivity): If , then
Proof: For , there exists an open cover of with . Note covers and also . So . Since this holds for any , we must have .
Lebesgue Measure
Lebesgue measurable sets for a -algebra
We say a set of subsets in is a -algebra is it contains and , and closed under both complementation and countable unions. The smallest -algebra that contains all open sets is called Borel -algebra.
Lebesgue outer measure is defined on all numbers in . It does not have countable additivity property. We will see that if we restrict the measure function to a -algebra on then we have the desired property. After this restriction, these sets become “Lebesgue measurable”.
Formally, a set is said to be Lebesgue measurable if for any set , we have
By sub additivity, it is always true that . So to prove a set is measurable, we only need to show for any set
Now we will see most “normal” sets are measurable, and these sets form a -algebra.
If a set is measurable, we would simply use to denote its measure.
Theorem 3: If a set is measurable, so is its complement .
Proof: Since , for any set we have .
Theorem 4: Any set with outer measure is measurable.
Proof: Note for any set , we have since . To show , we only need to show .
Notice . By sub additivity we have . Since we must have both and .
Theorem 5: and are both measurable.
Proof: Since , is measurable. By Theorem 3, is also measurable.
Theorem 6 (Countable Additivity): If are disjoint and measurable, then is also measurable and . This is also true when is infinite.
Proof: We first prove it for finite cases. We do it by induction on . Consider where are disjoin and measurable. For any set , we let , and . To show is measurable, we need to show
Since is measurable, we have . Since is measurable, we have
. So we have . By (1) and (2), it remains to show , which is true by outer measure’s sub additivity. Hence is measurable.
Again since is measurable, we have . This proves the theorem holds for . It is easy to extend it to finitely sets by considering .
To show is measurable, consider . For a set , suppose is finite and , where . Then we have
So there must exist some such that
Since is measurable, we have . So (3) leads to . But . So (4) is a contradiction. This proves that is measurable.
Finally, to show , note that for any we must have
Thus we must have . Combining the fact that , we complete the proof.
Theorem 7 (closed under countable unions): If are measurable, is also measurable.
Proof: This is easily seen from Theorem 6. Simply construct and . Then are disjoint. By Theorem 3 and 6, they are all measurable. Again by Theorem 6, is also measurable.
Theorem 8: Any interval of the form is measurable.
Proof: Consider a set . We may assume . Otherwise we can let .
Consider an open cover of . For each let and . So . covers and covers .
Thus,
(1) holds for any open cover . Hence we have .
With Theorem 7 and the fact that measurable sets form a -algebra, we have
Theorem 8: Any Borel set is measurable.
Proof: This is easily shown by combining Theorem 3, 6, 7.
Approximation
One of Littlewood’s principles states: a Lebesgue measurable set is almost open.
We shall construct several approximations of a measurable set by open and closed sets.
Lemma 9 (Excision Property): If is measurable with finite measure, then for any containing we have
Proof: Note and since . Since is measurable, . Finally, since is finite, we obtain the result as desired.
Lemma 10: If is measurable with , then where are disjoint, measurable, of finite measures, and
Proof: Take and , . Then each or satisfies all the properties. By countable additivity, we obtain what we desire.
Theorem 11: (Outer and Inner Approximation): Let be a set of real numbers. Then each of the following statement is equivalent to the measurability of .
Outer approximation
For any , there is an open set containing for which
There is a set, i.e. a countable intersection of open sets, containing for which
Inner approximation
For any , there is a closed set contained in for which
There is a set, i.e. a countable union of closed sets, contained in for which .
Proof: Note (1) is equivalent to (3) and (2) is equivalent to (4). To see this, suppose (1) holds. For any there is some for which . Take . Then and . So . Similarly, if (2) holds then (4) also holds. Conversely, we can also easily show (3) implies (1) and (4) implies (2).
So it suffices to establish that measurability of implies (1) and (1) and (2) and (2) implies measurability of .
Suppose is measurable. First we assume . For any there is an open cover such that . Let . Then . By Excision property, we have .
Now suppose . By Lemma 10, where are disjoint and measurable with finite measures. Then (1) holds for each with some and . So by taking , we have .
To see (1) implies (2), consider . Then by (1) there is an for which and . Take . We have and for any . Thus, , for any . Hence .
Finally, to see (2) implies (1), note is measurable since its outer measure is 0. Also is measurable. So is measurable. .
Now we can show one of Littlewood’s principle.
Theorem 12: If is measurable with finite measure, for any there is an open set such that .
Proof: By Theorem 11, there is an open set s.t. . Since is of finite measure, so is (by Excision property). can be expressed as the union of a collection of disjoint open intervals, i.e. . Since , there is some s.t. when we have .
Now take . Then and thus . On the other hand, since we have . Thus .
Continuity
The idea is that if , then . Formally, we have the following theorems.
Theorem 13: is an ascending sequence of measurable sets, i.e. for every . Then
Proof: If for some s.t. when all , the theorem will hold. So we may assume all .
Let and for . Then . Thus,
The summantion of right hand side of (1) is equal to . Hence we have the desired result.
Theorem 14: is a descending sequence of measurable sets and , i.e. for every . Then,
Proof: Let . Since is descending, is ascending. So , and by Theorem 11
By Excision property, from (2) we have
Since , we obtain the result as desired.
For a measurable set , we say a proper holds almost everywhere (a.e. for short) in if there is a measurable subset s.t. and the property holds for all .
By Theorem 12, we can prove Borel-Cantelli Lemma.
Borel-Cantelli Lemma: is a collection of measurable sets and . Then no almost every appears only in finitely many .
Proof: In other words, almost no will appear infinitely often in . This is equivalent to show
Let be an alphabet and be any language over it. We can define an equivalence relation on the set of all strings over as:
Strings , iff for any string , either both are in , or neither of them is in
To see why is an equivalence relation, check
: this is obvious, you can consider
If , then : this is also obvious
If and , then : consider a string . Asssume but . But then we must have because , and because , which leads to a contradiction
An equivalence relation partition the strings into equivalence classes . You should be convinced of the following facts:
The language must be union of some equivalence classes, i.e. .
It is right-invariant, i.e. if , then for any we have .
must be union of some equivalence classes.
You can see that if two strings then they are “indistinguishable”, because any adding any suffix to them will make them either both in or both not in .
The above fact 2 is not trivial. Let’s prove a lemma about it.
Lemma 1: The equivalence relation is right-invariant, i.e. for any strings if then for any string we must have .
Proof: Consider any string . Since , is also a string, therefore by definition we have . Again by definition, and are indistinguishable, i.e. .
Fact 3 is not difficult. But you can quite easily prove it.
Lemma 2: must be union of some equivalence classes.
Proof: Suppose it is not the case. Then there is some class and exists but and . But since , by taking we must have both in or both not in , which leads to a contradiction.
Myhill-Nerode Theorem
This theorem essentially characterize regular languages in another way. For any regular language, the set of equivalence classes defined by is finite. The reverse is also true.
Theorem 3 (Myhill-Nerode): is a regular language, if and only if, the set of equivalence classes defined by is of finite.
Proof: (Left to Right). Suppose is regular. Then there must exist a DFA D that accepts it. We can define an equivalence relation based on D as this: iff . (Note: if for some state and some symbole , is not defined, you can always create some “dead” states to absorb it and create self loops to never let it out. The resulting is still a DFA).
It can be easily checked that
is indeed an equivalence relation
, i.e. is the union of classes that correspond to final states
Now you can also prove that is a refinement of , meaning whenenver , we must have . As a result, the number of equivalence classes in is no greater than the one in . Since the latter must be finite, the former is also finite. This completes the proof of the first part.
(Right to Left). Suppose yields a finite partition with its equivalence classes. By Lemma 1, we know is also right-variant. Now construct a DFA as follows:
The states are equivalence classes in
is the state corresponding to
Final states are the equivalence classes that form . By Lemma 2, we know this is well defined
For transition function, we have iff . The latter notation means for any string we have
Now we show that this DFA corresponds to the language , thus establishing that is regular. Without loss of generality, we let to denote the class . So, we will show iff .
Suppose and we ues induction on the length of . Since , we must have as this is the definition. Assume it holds for length . Now consider with and is a symbol. If , then where by induction assumption. By definition of , we must have iff . This proves that .
Conversely, suppose . Again we use induction. Obviously if we have . Now assume it holds for and consider with . By induction assumption, if then we have . Observe . So for a where belongs, we know . The remaining thing is to establish . This would require . This is equivalently to say for any , you need to make sure there is a unique that . This is true because is right-invariant. If there are 2 such different supersets and , by right-invariance you would have to merge them into one . This completes the proof.